Extracting Multiword Terms from Document Collections
نویسندگان
چکیده
Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collection, and maybe quite useful specially for highly refining generic queries. In this paper, we introduce the LocalMaxs algorithm, for automatically extracting multiword terms. This algorithm requires neither empirically suggested thresholds nor complex linguistic filters nor language specific morpho-syntactic rules. These features make this algorithm a suitable approach to extract MWTs from text collections written in any language. Moreover, by introducing the Fair Dispersion Point Normalization concept, we can deal with arbitrarily long MWTs and can compare the results obtained by using different word association measures for MWTs selection. We also introduce our own association measure, the SCP, to work with the LocalMaxs algorithm, and assess the results obtained by comparing it with related statistics-based measures (Specific Mutual Information, Dice, Loglike and coefficients) used in experiments on a text collection. An Information Retrieval application using our approach is also presented.
منابع مشابه
Extracting Multiwords From Large Document Collection Based N-Gram
Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collecti...
متن کاملCombining Linguistics with statistics for multiword term extraction: a fruitfull association?
The acquisition of multiword terms from large text collections is a fundamental issue in the context of Information Retrieval. Indeed, their identification leads to improvements in the indexing process and allows guiding the user in his search for information. In this paper, we present an original methodology that allows extracting multiword terms by either (1) exclusively considering statistic...
متن کامل$xwrpdwlf 'lvfryhu\ Dqg $jjuhjdwlrq Ri &rpsrxqg 1dphv Iru Wkh 8vh Lq .qrzohgjh 5hsuhvhqwdwlrqv
$EVWUDFW Automatic acquisition of information structures like Topic Maps or semantic networks from large document collections is an important issue in knowledge management. An inherent problem with automatic approaches is the treatment of multiword terms as single semantic entities. Taking company names as an example, we present a method for learning multiword terms from large text corpora expl...
متن کامل$xwrpdwlff'lvfryhu\dqgg$jjuhjdwlrqqrii&rpsrxqgg 1dphvviruuwkhh8vhhlq.qrzohgjhh5hsuhvhqwdwlrqvv
Automatic acquisition of information structures like Topic Maps or semantic networks from large document collections is an important issue in knowledge management. An inherent problem with automatic approaches is the treatment of multiword terms as single semantic entities. Taking company names as an example, we present a method for learning multiword terms from large text corpora exploiting th...
متن کاملExtracting Multiword Translations from Aligned Comparable Documents
Most previous attempts to identify translations of multiword expressions using comparable corpora relied on dictionaries of single words. The translation of a multiword was then constructed from the translations of its components. In contrast, in this work we try to determine the translation of a multiword unit by analyzing its contextual behaviour in aligned comparable documents, thereby not p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999